How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. (arXiv:2305.00586v5 [cs.CL] UPDATED)
Pre-trained language models can be surprisingly adept at tasks they were not
explicitly trained on, but how they implement these capabilities is poorly
understood. In this paper, we investigate the basic mathematical abilities
often acquired by pre-trained language models. Concretely, we use mechanistic
interpretability techniques to explain the (limited) mathematical abilities of
GPT-2 small. As a case study, we examine its ability to take in sentences such
as "The war lasted from the year 1732 to the year 17", and predict valid
two-digit end years (years > 32). We first identify a circuit, a small subset
of GPT-2 small's computational graph that computes this task's output. Then, we
explain the role of each circuit component, showing that GPT-2 small's final
multi-layer perceptrons boost the probability of end years greater than the
start year. Finally, we find related tasks that activate our circuit. Our
results suggest that GPT-2 small computes greater-than using a complex but
general mechanism that activates across diverse contexts.